# EECS 370 Classifying Cache Misses



#### Announcements

- Lab 10 due Wed
  - Lab 11 meets Fr 11/17 and M 11/27 (after break)
- P4 out
  - Due Thu 11/30
  - Check out simulator on website (not same as project... doesn't cache instrs)
- HW 4 out
  - Due Mon 12/4





#### What about cache for instructions

- We've been focusing on caching loads and stores (i.e. data)
- Instructions should be cached as well
- We have two choices:
  - Treat instruction fetches as normal data and allocate cache lines when fetched
  - 2. Create a second cache (called the instruction cache or ICache) which caches instructions only
    - More common in practice

How do you know which cache to use? What are advantages of a separate ICache?



#### Integrating Caches into Pipeline

- How are caches integrated into a pipelined implementation?
  - Replace instruction memory with Icache
  - Replace data memory with Dcache
- Issues:
  - Memory accesses now have variable latency
  - Both caches may miss at the same time



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



#### Improving our Caches

- If our cache is getting a lot of misses, how do we improve it?
  - Depends on why the misses occurring
  - Is the cache too small? Is the associativity too restrictive? Something else?
- A decent first step is to classify the types of missing we are observing



## Classifying Cache Misses

- Cache misses happen for 3\* reasons
  - The 3C's of Cache misses:
- Compulsory miss
  - We've never accessed this data before
- Capacity miss
  - Cache is not large enough to hold all the data
  - May have been avoided if we used a bigger cache
- Conflict miss
  - Cache is large enough to hold data, but was replaced due to overly restrictive associativity
  - May have been avoided if we used a higher-associative cache

<sup>\*</sup>On multi-core systems, there's a 4<sup>th</sup> C – take EECS 470/570 to learn more



## Classifying Cache Misses

- Scenario: run given program on system with N-way cache of size M
  - Identify each miss
- We can classify each miss in a program by simulating on 3 different caches
  - If miss still occurs in cache where size >= memory size: compulsory miss
  - Else, if miss occurs in fully associative cache of size M: capacity miss
  - Else, if miss occurs in N-way cache of size M (original cache): conflict miss



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



## 3C's Sample Problem

4 block

Consider a cache with the following configuration: write-allocate, total size is 64 bytes, block size is 16 bytes, and 2-way associative. The memory address size is 16 bits and byte-addressable. The replacement policy is LRU. The cache is empty at the start.

For the following memory accesses, indicate whether the reference is a hit or miss, and the type of a miss (compulsory, conflict, capacity)

block offset.

1b bits.

4 bit

4 bit

4 bit

4 biock.



64 bytes total, 16 byte 3 C's Practice Problem – 3 C's blocks, 2-way, 2 sets Det. Thie. **Infinite Address** 3Cs FA SA 3 0x00 M Compulsory M M 0x14 M M M お文人と文 0x27 M M M Compu Som **0x08** 0x38 M M Compulsory 0x4A Compsory M SA QZA Set o 0x18 0x27 1 <u>apacity</u> 0x0F M  $\mathcal{N}$ SET | 0x40  $\Lambda \Lambda$ 



## 3 C's Practice Problem – 3 C's

64 bytes total, 16 byte blocks, 2-way, 2 sets

since I only care about "hit or "miss" so. I don't care about block offset here.

b bytes > 4 bit

| Address      | Infinite | FA                | SA   | 3Cs       |
|--------------|----------|-------------------|------|-----------|
| 0x0 <b>0</b> | Miss     | Miss              | Miss | Comp      |
| 0x14         | 11.155   | Miss              | Miss | Conp      |
| 0x27         | Miss     | Miss              | M143 | Comp      |
| 0x08         | hit      | hit               | hit  | _         |
| 0x38         | Miss     | Misi              | Miss | Comp      |
| 0x4A         | Miss     | Miss              | Mi45 | Conp      |
| 0x18         | hit      | Miss              | hie  |           |
| 0x27         | hit      | make coche longer | Miss | Campairer |
| 0x0F         | hit      | 133 K             | M153 | Comparity |
| 0x40         | hit      | hit               | 145} | Comflict  |



## 3 C's Practice Problem – 3 C's

<u>Poll:</u> How many blocks will be in a 64 byte FA cache?

| Address | Infinite | FA | SA | 3Cs |
|---------|----------|----|----|-----|
| 0x00    | M        |    |    |     |
| 0x14    | M        |    |    |     |
| 0x27    | M        |    |    |     |
| 80x0    | Н        |    |    |     |
| 0x38    | M        |    |    |     |
| 0x4A    | M        |    |    |     |
| 0x18    | Н        |    |    |     |
| 0x27    | Н        |    |    |     |
| 0x0F    | Н        |    |    |     |
| 0x40    | Н        |    |    |     |

# $3 \ C's \ Practice \ Problem - 3 \ C's \ ^{64 \ bytes \ total, \ 16 \ byte}_{blocks, \ 2-way, \ 2 \ sets}$

| Address | Infinite | FA | SA | 3Cs |
|---------|----------|----|----|-----|
| 0x00    | M        | M  |    |     |
| 0x14    | M        | M  |    |     |
| 0x27    | M        | M  |    |     |
| 0x08    | Н        | Н  |    |     |
| 0x38    | M        | M  |    |     |
| 0x4A    | M        | M  |    |     |
| 0x18    | Н        | M  |    |     |
| 0x27    | Н        | M  |    |     |
| 0x0F    | Н        | M  |    |     |
| 0x40    | Н        | Н  |    |     |



## $3 \ C's \ Practice \ Problem - 3 \ C's \ ^{64 \ bytes \ total, \ 16 \ byte}_{blocks, \ 2-way, \ 2 \ sets}$

| Address | Infinite | FA | SA | 3Cs        |
|---------|----------|----|----|------------|
| 0x00    | M        | M  | M  | Compulsory |
| 0x14    | M        | M  | M  | Compulsory |
| 0x27    | M        | M  | M  | Compulsory |
| 0x08    | Н        | Н  | Н  |            |
| 0x38    | M        | M  | M  | Compulsory |
| 0x4A    | M        | M  | M  | Compulsory |
| 0x18    | Н        | M  | Н  |            |
| 0x27    | Н        | M  | M  | Capacity   |
| 0x0F    | Н        | M  | M  | Capacity   |
| 0x40    | Н        | Н  | M  | Conflict   |



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



#### How to reduce cache misses

#### Compulsory miss

- Reduce by increasing cache block size
  - Reduces total number of blocks for given cache size 🕾
- Or by using prefetching (guess we'll need data based on previous memory patterns - discussed more in EECS 470)

#### Capacity miss

- Reduce by building a bigger cache
  - Increase access latency ☺

#### Conflict miss

- Reduce by increasing associativity
  - Increase access latency / overheads 🕾



#### Cache Performance

- How does changing these parameters affect performance?
  - Cache size
  - Block size
  - Associativity



#### Cache Size

- Cache size in the total data (not including tag) capacity
  - bigger can exploit temporal locality better
  - not ALWAYS better
- Too large a cache adversely affects hit & miss latency
  - smaller is faster => bigger is slower
  - access time may degrade critical path
- Too small a cache
  - doesn't exploit temporal locality well
  - useful data replaced often
- Working set: the whole set of data executing application references
  - Within a time interval





#### Block size

- Block size is the data that is associated with an address tag
  - Sub-blocking: A block divided into multiple pieces (each with V bit)
    - Can improve "write" performance
    - Take 470 to learn more
- Too small blocks
  - don't exploit spatial locality well
  - have larger tag overhead
- Too large blocks
  - too few total # of blocks
    - likely-useless data transferred
    - Extra bandwidth/energy consumed





## Associativity

- How many blocks can map to the same index (or set)?
- Larger associativity
  - lower miss rate, less variation among programs
  - diminishing returns
- Smaller associativity
  - lower cost
  - faster hit time
    - Especially important for L1 caches





## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



#### Practice Problem 1: CPI with caches

The blaster application run on the LC2k with full data forwarding and all branches predicted not-taken has the following instruction frequencies:

45% R-type 20% Branches

15% Loads 20% Stores

In blaster, 40% of branches are taken and 50% of LWs are followed by an immediate use.

The I-cache has a miss rate of 3% and the Dycache has a miss rate of 6% (no overlapping of misses). On a miss, the main memory is accessed and has a latency of 100 ns. The clock frequency is 500 MHz.

$$1+0.2 \times 0.4 \times 3 + 0.15 \times 0.5 \times 1 + 1 \times 0.03 \times 50$$
 $+0.35 \times 0.06 \times 50$ 

$$\frac{1}{500\times10^6} = 0.5\times10^9 = 2n\xi.$$

$$100n\xi = 2n\xi = 50 \text{ cylle}.$$

#### Problem 1 Solution

The *blaster* application run on the LC2k with full data forwarding and all branches predicted not-taken has the following instruction frequencies:

45% R-type 20% Branches

15% Loads 20% Stores

In blaster, 40% of branches are taken and 50% of LWs are followed by an immediate use.

The I-cache has a miss rate of 3% and the D-cache has a miss rate of 6% (no overlapping of misses). On a miss, the main memory is accessed and has a latency of 100 ns. The clock frequency is 500 MHz.

What is the CPI of *blaster* on the LC2k?

```
Stalls per cache miss = 100 ns / 2ns = 50 cycles (500 Mhz \rightarrow 2ns cycle time)

CPI = 1 + data hazard stalls + control hazard stalls + icache stalls + dcache stalls

CPI = 1 + 0.15*0.50*1 + 0.20*0.40*3 + 1*0.03*50 + 0.35*0.06*50
```



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



- Say you have the following:  $2x10^{12}$  W  $1x10^{12}$  SW
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- How many bytes of memory would be read and written if:
  - We had no cache?  $(2+1) \times 10^{12} \times 4 = 12 \times 10^{12} = 1.2 \times 10^{13}$
  - We had a write-though cache with a no-write allocate policy?
  - We had a write-back cache with a write-allocate policy? (Assume 25% of all misses result in a dirty eviction)

```
For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12}

For sw: each time will update the memory: |x|o^{12} \times 4 = 4x|o^{12} \times 4 = 4x|o^{12}
```

- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- How many bytes of memory would be read and written if:
  - We had no cache?
  - We had a write-though cache with a no-write allocate policy?
  - We had a write-back cache with a write-allocate policy? (Assume 25% of all misses result in a dirty eviction)

```
t. load from eache:

full and dirty is 1: 2×10°2×32×0.05×0.25×2=1.6 Billion

miss: load from memory

not full on full and dirty is 0: 2×10°2×32×0.05×0.75 = 3.4 Billion

Sw | D hit; load from cache:

miss: load from memory

full and dirty is 1: 1×10°2×32×0.05×0.25×0.25 = 1.2 Billion.

not full or full and dirty is 0: 1×10°2×32×0.05×0.75 = 1.2 Billion.
```

- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Let's start with the no-cache case.
  - All stores go to memory and are 4 bytes each
    - Writes: 1 billion stores\* 4 bytes = 4 billion bytes
  - All loads go to memory and are 4 bytes each.
    - Reads: 2 billion loads\* 4 bytes = 8 billion bytes
- Write-through with no write-allocate?



- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Write-though, no allocate.
  - All stores still go to memory and are still 4 bytes each.
    - Writes: 1 billion stores\* 4 bytes = 4 billion bytes
  - Only loads that miss in the cache go to memory. But they read the full cache block.
    - Reads: 2 billion loads\* 0.05\* 32 bytes = 3.2 billion bytes



- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Write-though, no allocate.



- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Write-back, write-allocate (25% of all misses result in a dirty eviction)



- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Write-back, write-allocate (25% of all misses result in a dirty eviction)
  - Store misses result in a cache block being read.
    - Reads: 1 billion stores\* 0.05\* 32 bytes = 1.6 billion bytes
  - Load misses result in a cache block being read.
    - Reads: 2 billion loads\* 0.05\* 32 bytes = 3.2 billion bytes
  - So that is 4.8 billion bytes of data read.



- Say you have the following:
  - A program that generates 2 Billion loads and 1 Billion stores, each 4 bytes in size.
  - A cache with a 32-byte block which gets a 95% hit rate on that program.
- Write-back, write-allocate (25% of all misses result in a dirty eviction)
  - Store misses result in dirty eviction 1/4 of the time.
    - Reads: 1 billion stores\* 0.05\* 32 bytes\*(.25) = 0.4 billion bytes
  - Load misses result in a cache block being read.
    - Reads: 2 billion loads\* 0.05\* 32 bytes\*(.25) = 0.8 billion bytes



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



## Practice Problem 3: CPI w/ Caches 2

$$\frac{1}{2 \times 10^8} = \frac{1}{0.1 \times 10^5} = 5 \text{ ns} < 1 \text{ cycles}.$$

- Given a 200 MHz processor with 8KB instruction and data caches and a with memory access latency of 20 cycles. Both caches are 2-way associative. A program running on this processor has a 95% icache hit rate and a 90% dcache hit rate. On average, 30% of the instructions are loads or stores. The CPI of this system, if caches were ideal would be 1.
- Suppose you have 2 options for the next generation processor, which do you pick?  $ICP = 1 + 0.3 \times 0.1 \times 40 + 1 \times 0.05 \times 40 = 1 + 1.2 + 2 = 4.2$ 
  - Option 1: Double the clock frequency—assume this will increase your memory latency to 40 cycles. Also assume a base CPI of 1 can still be achieved after this change.  $4.2 \times 5 n_5 \times 0.5 = 10.5 n_5$
  - Option 2: Double the size of your caches, this will increase the instruction cache hit rate to 98% and the data cache hit rate to 95%. Assume the hit latency is still 1 cycle.  $I \varphi = 1 + 0.3 \times 0.05 \times 20 + 1 \times 0.02 \times 20 = 1 + 0.3 + 0.4 = 1.7 \times 50.5 = 85.65$



## Practice Problem 3: CPI w/ Caches 2

- Given a 200 MHz processor with 8KB instruction and data caches and a with memory access latency of 20 cycles. Both caches are 2-way associative. A program running on this processor has a 95% icache hit rate and a 90% dcache hit rate. On average, 30% of the instructions are loads or stores. The CPI of this system, if caches were ideal would be 1.
- Suppose you have 2 options for the next generation processor, which do you pick?
  - Option 1: Double the clock frequency—assume this will increase your memory latency to 40 cycles. Also assume a base CPI of 1 can still be achieved after this change.
  - Option 2: Double the size of your caches, this will increase the instruction cache
    hit rate to 98% and the data cache hit rate to 95%. Assume the hit latency is still 1
    cycle.



#### Practice Problem 3: Solution

Option 1: (double clock freq, base cycle time is 5 ns, so new cycle time is 2.5 ns)

```
CPI = baseCPI + IcacheStallCPI + DcacheStallCPI

CPI = 1.0 + 0.05*40 + 0.3*0.1*40 = 4.2
```

Execution time = 4.2 \* Ninstrs \* 2.5ns = 10.5ns \* Ninstrs

Option 2 (icache/dcache miss rates lowered to 2% and 5%)

```
CPI = baseCPI + IcacheStallCPI + DcacheStallCPI

CPI = 1.0 + 0.02*20 + 0.3*0.05*20 = 1.7
```

Execution time = 1.7 \* Ninstrs \* 5ns = 8.5ns \* Ninstrs

Therefore, Option 2 is the better choice



## Agenda

- Motivation
- Example
- How to optimize cache design
- Practice Problem 1
- Practice Problem 2
- Practice Problem 3
- Practice Problem 4



Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit

```
addresses

00110000 1111

00210000 1111

00210000 1111

00210000 1111
0x30f - Miss block size cont be too large.
0x510 - Miss
9x31f - Hit 0011 0001 0000
                                      block size >16
0x72d - Miss
0x72f - Hit
                                                       Block size: ? 16
0x320 - Miss
                                                       Associativity: ?
0x520 - Miss
                                                       Number of sets: ?
0x720 - Miss
```



Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit addresses humber of blocks: 29/24 = 25 lines 1 I (hex) Since 0x5|0 will replace 0x310.  $\frac{2^5}{2} = 2^4$  Sets and 0x31f is imposible to have hit. - Miss ♂ 0x30f - Miss 0x510 2) Could it be two-way cache? - Hit 0x31**f** 图个Set里的个way. 1 bit tay. 4 bits set index. Start with directly map couch - Miss <sup>2</sup> - Hit 0x72**f** Block size: ? 16 Miss > Miss 2 Number of sets: ? Miss\_

Similar to homework! Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit addresses

```
0x310 - Miss
```

$$0x72f - Hit$$

$$0x520 - Miss$$



Similar to homework! Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit addresses

```
0x310 - Miss
```

$$0x72f - Hit$$

$$0x520 - Miss$$



Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit addresses

0x310 – Miss



0x30f – Miss







0x72d - Miss

0x72f - Hit

0x320 – Miss

0x520 - Miss

0x720 - Miss

#### Determine block size

First hit must be brought in by another miss

Take closest address: 0x310, so know block size must be at least 16 bytes so 0x31f brought in when 0x310 miss occurs

Now, is the block size larger? Know that 0x30f was a miss, thus 0x310 and 0x30f not in the same block. Thus, block size must be <= 16 bytes

Thus Block Size = 16 bytes



Here is the series of address references (in hex) to a cache of size 512 bytes. You are asked to **determine the configuration of the cache**. Assume 12-bit addresses

0x310 - Miss



0x30f - Miss





0x31f - Hit













Assume direct mapped: 3-bit tag, 5-bit index, 4-bit offset. If DM, 0x310 and 0x510 would both map to index 17, Thus 0x31f could not be a hit. So, not direct mapped.

Assume 2-way associative: 4-bit tag, 4-bit index, 4-bit offset This fixes the green accesses, and allows 0x31f to be a hit.

What about > 2-way associative?

Now we also know that 0x720 is a miss even though 3 accesses earlier 0x72f was a hit, and thus it is in the cache. The intervening 2 accesses must kick it out, 0x320 and 0x520. Both go to set 2. If the associativity was > 2, then 0x720 would be a hit. So, must conclude that cache is 2-way associative.

#### Next time

• Completing the hierarchy: virtual memory

